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Abstract 

In many applications data is naturally presented in terms of orderings of some basic el¬ 
ements or symbols. Reasoning about such data requires a notion of similarity capable of 
handling sequences of different lengths. In this paper we describe a family of Mercer kernel 
functions for such sequentially structured data. The family is characterized by a decom¬ 
posable structure in terms of symbol-level and structure-level similarities, representing a 
specific combination of kernels which allows for efficient computation. We provide an ex¬ 
perimental evaluation on sequential classification tasks comparing kernels from our family 
of kernels to a state of the art sequence kernel called the Global Alignment kernel which 
has been shown to outperform Dynamic Time Warping. 

Keywords: Kernel, Sequences 


1. Introduction 


Many types of data have inherent sequential structure. Sequences of letters in computa¬ 
tional linguistics, series of images in computer vision or cell structures in computational 
biology and arbitrary data sets depending on a parameter such as time provide familiar ex¬ 
amples of such data. It is hence not surprising that there exists a significant amount of work 


focused on representing such data. In (Rieck, 2011) the author reviews and broadly catego¬ 
rizes sequential similarity measures into three main categories: bag-of-words, edit-distance 
and string-kernel based methods. Bag-of-words (Harris, 1970) based similarity measures 


translate the notion of a sequence to a distribution over certain sub-sequences ( i.e . words 
in natural language processing) of the sequence itself, meaning that such measures only en¬ 
code the sequential structure up to the length of the sub-sequence and disregard information 
about word order. As such, Bag-of-words methods require us to be able to identify signif¬ 
icant sub-sequences (the words), which is not always obvious for sequences arising outside 
natural language. Nevertheless, this approach captures some structure and, as the sequen¬ 
tial data is translated into a vector space whose basis consists of elementary subsequences, 
it allows us to interpret the data and enables us to use well-developed learning methods 


for such vectorial data. Techniques based on edit distances (Damerau, 1964; Levenshtein 
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1966) relate sequences by defining a transformation from one sequence to the other and 


associating a cost to the transformation. Edit distances can be very useful if the notion of 
cost with respect to different transformations is well grounded. The third category refers 
to (dis)similarity measures defined by implicitly specifying an inner-product space through 
a kernel function between sequences. String kernels (Lodhi et ah, 2002; Rousu and Shawe- 


Taylor[ [2006 ) were proposed in Computational Linguistics, where data consists of sequences 
(text) of discrete symbols (letters). The (dis)similarity measure is defined in terms of “gaps” 
between symbols in the two sequences. String kernels are a specific instance of a larger class 


of kernel functions referred to as rational kernels (Cortes et ah, 2004). Rational kernels are 


related to weighted automata (Mohri, 2009) and define inner products from the specific se¬ 


quential structure described by the automata. In this paper, we will focus on a new family 
of kernel based (dis)similarity measures. 

The contributions of this work are in particular: a) A generic approach for the construc¬ 
tion of sequence kernels which scales 0{nm) in the lengths n, m of the input strings, b) The 
kernel decomposes intuitively into structure-level and symbol-level similarities. Compared 
to previous approaches, the structure of the symbol space can be encoded by any Mercer 
kernel, c) We show that a recently proposed intuitive (dis)similarity measure on sequences 


(Baisero et ah, 2013), is positive definite kernel and falls into our class, d) We compare and 


evaluate several kernels from our family which perform favourably against the state of the 
art Global Alignment kernel. 


2. Related work 

The three main sequence similarity approaches discussed above are all based on the concept 
that sequence similarity is defined in terms of discrete unordered symbols, and the similarity 
between two symbols a, 5, E S is typically is defined by zero if a ^ b and one otherwise. 
However, for many types of data the symbol space S might be continuous, and we might in 
fact have a natural similarity measure on S itself. As an example, consider the problem of 
matching two discretized waveforms a = [oq,..., <a n ], /? = [/3i,..., f3 m ] where aj,ftGR = S 
and where there exists a natural distance ||a — b || for a, b E E = M. A popular similarity 


measure closely related to edit distances is Dynamic Time Warping (Sakoe and Chiba, 1978 


Muller 2007). It provides a similarity measure based on the cost of aligning two sequences 


such that the sum of matching each element is minimized. This measure does not by itself 


correspond to a positive definite kernel function (Bahlmann et ah, 2002) and hence lacks 
a geometrical interpretation. One approach has been to use the dynamic time warping 


distance inside a radial basis exponential kernel function (Lei and Sun, 2007; Bahlmann 


et ah, 2002). However this still suffers from the drawback that dynamic time warping is not 


a kernel itself. Even though non-positive kernels have been shown to be useful (Haasdonk 


2005) in practice, they lack a geometrical interpretation and the mathematical justification 


which makes the use of kernel methods so appealing. 


Motivated by the intuition for the definition of dynamic time warping, (Cuturi et al. 


2007) developed a related similarity measure which in fact corresponds to a valid kernel 
function for sequences. Here a (dis)similarity function is defined by summarizing all possible 
alignments between two sequences through a ‘soft-min’ rather than using only the minimal 
cost alignment as in dynamic time warping. Importantly, compared to previous kernels on 
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sequences, this kernel is capable of incorporating a structured non-discrete symbol space 
E. The resulting kernel is referred to as the Global Alignment kernel and was shown to 
outperform Dynamic Time Warping for sequence classification. However, to be a valid 
Mercer kernel, the structure of the symbol space E have to be induced by a specific class 
of kernel functions. Further, it strongly favors small sequence perturbations over larger 
perturbations which reduces the ability of the kernel to generalize example data. Some 
of these issues have been addressed in (Cuturi, 2010) where only a subset of the possible 
alignments contributes to the inner-product. 


Another approach was taken in (Baisero et al. 2013), where we proposed a (dis)similarity 


measure called the Path kernel. Just like the Global Alignment kernel, this kernel is defined 
by reasoning about the (dis)similarity of all possible alignments of two sequences. In ex¬ 
periments, the Path kernel performed better than the Global Alignment kernel for a set of 
experiments both with respect to accuracy and computational cost. We will show that this 
kernel naturally falls into a class of kernels that we will define in this work, thus proving 
that it is positive semi-definit^J 


3. On the construction of sequence kernels 

We are interested in finite sequences s = (si,..., S| s |), with symbols G E, belonging 
to a symbol space E which can be discrete or continuous. We denote the set of such 
finite sequences by Seq( E) and are interested in studying combinations of Mercer kernel 
functions on symbols : S x E G R that yield valid Mercer kernels on a sequential level 
Seq( E). We follow the convention of calling a kernel k : X x X R positive definite if 
Y^ij =l c ik( x i-> x j) c j ^ 0 for any finite subset {aq,..., x n } C X, n G N and any {ci,..., c n } C 
M. Let us now describe a novel general approach towards the construction of such kernels 
for sequences belonging to Seq( E): 

Lemma 3.1 Let k^ : E x E —)► M be a continuous positive definite kernel on E ; where E is 
a separable metric space and let ks : N x N R be a positive definite kernel on integers. 
Then the kernel 


|s| |t| 

fc(s, t) EE M s i,tj)k s (i,j), (1) 

i =1 3 =1 


defined for any finite sequences s, t G Seq( E) ; s = (si,..., S| s |) and t = (ti,..., t| t |) is also 
positive definite. 


Proof Observe that both k ^ and ks can be trivially extended to kernels on E x N by 
Ah((s,i), (t, j)) /cs(s;,tj), K 2 {{ s,i),(t,j)) = ks(ij) for s h tj G E and ij G N. Now 

if((s, i), (t, j)) — iFi((s,i), (t, j))iF 2 ((s, i), (t, j)) is a positive kernel on U = ExN. Let 
X,Y be finite subsets of U. According to Lemma 1, (Haussler, 1999), the kernel 


l(x,y)= Yi K ^y) 

x£X,y€:Y 


1. Note that, the Path kernel defined in ( 

Baisero et al. 2013) is not related to the special Rational kernel 

proposed in (Takimoto and Warmuth 

2003), which is also referred to as a Path kernel 
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is then also positive definite. Note that a sequence s = (si, S2,..., s n ) corresponds to a 
subset X = {(si, 1), (S2, 2),..., (s n , n)}, and thus the above kernel is positive definite. ■ 

Note that any countable discrete space and any finite dimensional vector space can be given 
the structure of a separable metric space. If X is discrete and countable, any kernel on X is 
trivially continuous with respect to the discrete topology. Note also that, while the above 


result readily follows from the work on convolution kernels by (Haussler, 1999), the above 


natural class of kernels has - to the best of our knowledge - not been studied or formulated 
in this manner. This might be partially, because classical kernels coming from natural 
language processing often only consider similarity measures on the symbol space X directly. 

We observe that the family of kernels described above relates all pairs of the input 
sequences’ symbols using k s and adjusts these values according to similarity of positions of 
the symbols within the sequences, as measured by ks- An added benefit of the proposed 
family of kernels is their relative computational simplicity, since kernel evaluations scale like 
0 (|s||£|) in the length of the input strings s,t. Noting that, 


k( s,t) = tr(K^(s,t) T K s (\s\, |t|)), 


( 2 ) 


where [K s (s, t)]^- = fcs(s;,tj), [Ys(|* 


Rj 


= k s (i,j), i = l,...,|s|, j = and tr 


denotes the trace, we observe that the matrix Ks can be pre-computed once the maximal 
length of any sequence in a data-set is known. The evaluation of the kernel is then just a 
trace of a matrix product which can be efficiently implemented. 

In a typical scenario ks and k^ might also depend on parameters 6s G R n and 6 s G R m . 
These parameters can be set through cross-validation but they can also be learned if the 
gradients of the kernel functions with respect to these parameters can be computed. If we 
wish to use the kernel to represent a functional relationship / : sGy, where y G K d , we can 


encode a preference over the mapping / by a Gaussian process (Rasmussen and Williams 


2006). If the co-variance in the output space is encoded by a sequence kernel k, the parame¬ 


ters, and 6s can then be learned by maximizing the marginal likelihood of the model. In 
order to accommodate classification in a Gaussian process framework, the regression noise 


is usually squashed (Rasmussen and Williams, 2006) rendering the integration required to 


reach the marginal likelihood infeasible. However, it has been observed that learning the 
parameters for a classification task with 1 — C encoding, where each class is encoded using 


a binary variable, and a Gaussian noise assumption works well in practice (Kapoor et al. 


2009). 


4. Examples of Sequence Kernels 


In this paper we will focus on kernels from the family in Lemma |3.1| A straightforward 
approach to formulate such a sequence kernel would be to pick a familiar kernel ks, where 
\i — j\ determines the impact on k. This can be implemented by a stationary kernel ks such 
as an exponential kernel: 

Corollary 4.1 For a > 0, the function k e : Seq(Yf) x Seq(Yf) -A R given by, 

|S| W ,|._.||2 

ke( S,t) = XX^R’R 12 > ( 3 ) 

1=1 j = 1 
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Figure 1: The above figure shows five different structure kernel matrices. The left-most im¬ 
age depicts the exponential kernel while the remaining four show the structure 
kernel kr for the path sequence kernel for varying parameter values. The values 
of Cd = {0.3,0.35,0.35,0.3} and Chv = {0.3,0.33,0.37,0.37}. Changing the pa¬ 
rameters Cd and Chv emphasizes different parts of the alignment highlighting the 
non-stationary structure of how the contribution of different parts of a sequences 
is accumulated. 


is a valid kernel on SeqifS) corresponding to the exponential structure kernel on N. 


Similarly, we can now define kernels by defining a kernel ks on integers by restricting any 
known kernel on R to the integers. Examples include the polynomial kernels ks(i,j) — {ij + 
c) rf , the perceptron kernel, etc. While the above kernel follows readily from the definition of 
our class of sequence kernels, we would now like to focus on a secondary viewpoint which, 
as we will show, also leads to a sequence kernel as in Lemma |3T] In (Baisero et ah, 2013), 
the authors proposed a novel (dis)similarity measure k p : SeqifS) x SeqifL I) -A R which can 
be defined most elegantly in a recursive fashion as, 


(Baisero et ah, 2013 


k p ( s,t) 


' fcs(si,ti) 

T C\ lv kp( K S2 :, t) 

< "F Chy/q^S, 1 2:) 

+ Cd.k p {s2: , 12 :) 

0 

v 


|s| > l|t| > 1 
C d ^0 

Chv^O 

otherwise, 


( 4 ) 


where t 2 : denotes the sequence obtained by removing the first symbol from t E Seq(Yh). 
The recursive formulation above can be interpreted as accumulating information from all 
possible alignments of two strings. An alignment of two strings s and t is defined by a 
path 7 through a matrix M of size |s| x |t| from element [M]n to [IVt]| s || t |- Each path 
defines a different alignment in terms of “stretches” of a sequence see Figure |2j Each path 
is decomposed into series of simple operations which have a different effect, parametrized 
by Cd and Ch v , on the final similarity measure. The cardinality of the set of paths, and 
therefore of the alignments, for two sequences s and t is the Delannoy number D(|s|, |t|). 


In addition to the recursive formulation above, (Baisero et ah, 2013) also showed that the 
resulting function can be written in a similar form to Eq. [l] where ks = &r and, 


Mb 3) = 


min(z,j) — 1 

c<i+j-2-2d 

d =0 


ai 


ii + j- 2 - d)\ 


(i — 1 — d)\(j — 1 — d)\d\ 


( 5 ) 
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Figure 2: The image above depicts the Path kernel interpreted as an alignments of paths. 

Two sequences t = [ti, ..., and s = [si, ..., sq] and three separate alignments 
71 , 2,3 are shown. The matrix to the left shows how a subset of the possible align¬ 
ments between s and t can be constructed as paths between the top-left to the 
bottom-right element of the matrix to the right. On the right, the three different 
alignments generated from the paths on the left are shown. 


While the recursive definition of k p is natural, since it assigns a cost for diagonal and off- 


diagonal moves in a matrix, the above formula seems rather unintuitive. While (Baisero 


et al. 2013) did not provide a proof of positive definiteness, we will now show that the path 


kernel k p does in fact define a positive definite kernel and that k p naturally falls into the 
class of kernels considered here. 

In order to prove that k p is a positive definite kernel, we now need to show that kr : 
N x N —» R is indeed a kernel on integers. First lets recall that the Gamma function 
T : C —R is defined by, 


r(*) 


t z ~ 1 e~ t 


di, 


( 6 ) 


for z E C, IZe(z) > 0 and that we have T(n) = (n — 1)! for n G N. Let us now think of the 
factorial as a curious example of a positive Mercer kernel on integers: 


Lemma 4.2 Let d G Z and Xd = {xgN:x^|}. The function k : Xd x Xd R defined 

by, 


k{pc , x ') = (x + x' — d)!, 


is a positive definite kernel on X^ corresponding to the feature mapping ifd : Xd —A L 1 (R^o) 
mapping x G Xd to the function f x (t ) = t x ~ 2 e - 2 . I.e., considering the standard inner 
product on L 1 (R^o) given by (f,g) = f£° f(t)g(t)dt for two integrable functions f,g G 
L 1 (R^ 0 ) ; we have k(x,x') = (f x ,fy)- 


Proof The result follows directly from the above integral formula for T(x + x' — d + 1). ■ 
Note that, if Cd ^ 0, we can use the idea of the above lemma and write kr(i,j) = 
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Y!d= 0 (lJ ) where ip d : X 2+2d -A L 1 (M^o) is the feature map mapping in¬ 

tegers to functions which is given by, 

and (f,g) is again the inner-product of functions obtained by integration. kd(i,j) = 
is a kernel on X 2 d+ 2 = {x G N : x > d}. Note that the condition d ^ 
min(i, j) — 1, z, j G N in the summation appearing in the definition of kr is equivalent to 
d < i and d < j , i.e. i,j G ^ 2 ^+ 2 - Let us call the extension of kd to N x N kd, so that 
kd(hj) = 0 if i or j £ X 2d+2 and k d (i,j) = k d (i,j) if ij G X 2d+2 . Then k d : N x N -> M is 
a positive definite kernel by construction and we have, 




min(z,j) — 1 

^ ^ —2—2d£jd 


d =0 


(z + j-2-d)! 

(i — 1 — d)\(j — 1 — d)\d\ 


min(z,j) — 1 

= k d (ij j) 

d =0 


min(z,j) — 1 

^ ^ ^d<ikd{^-> j)$d<j 

d =0 


min(i,j)-l oo 

= k d (i,j) = ^k d (i,j), 


d =0 d =0 

where = 1 if d < i and zero otherwise. For any finite set of integers, only finitely many 
terms in the sum above are non-zero and the kernel ks is clearly positive since it is a sum 
of positive kernels. 


Corollary 4.3 Let k^ : S x E R be a continuous positive definite kernel on E ; where E 
is a separable metric space. Then the associated path kernel k p : Seq( E) x SeqifiT) -A M is 
a positive definite kernel for any Cd^ 0 and Chv £ R. 


5. Experiments 


In this section we will experimentally evaluate the performance of the path sequence kernel 
on a set of real sequential classification data-sets. However, to provide intution for the 
approach, we will first show how the proposed path sequence kernel represent two sets of 
toy-data. One of the motivations behind this work is to provide a vectorial embedding of 
sequences of different length. To evaluate this we, generate 10 noisy sine and cosine curves 
of different length as shown in Figure |3| We now wish to find an embedding that separates 
the two classes of curves. By formulating the classification task as a regression problem to a 


1 — C encoding and placing a Gaussian Process prior (Rasmussen and Williams, 2006) over 


the mapping, we can learn the kernel parameters through a maximum-likelihood approach. 
In Figure [3j the resulting embedding is displayed, clearly showing how the kernel manages 
to separate the two classes. It is important to note that both the sine and the cosine curves 
are generated over a full period which means that the first order statistics for the curves are 
the same. The discriminating information in the data is hence contained in the sequential 
structure which the Path sequence kernel extracts. 
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Figure 3: The above figure depicts the the first toy data-set onto which we have applied the 
path sequence kernel. The left-most image shows the first class which consists of 
10 noisy sine curves of different length while the middle image shows the other 
class which consists of noisy cosine curves. The dashed line indicate the length 
of the longest and the shortest curve in each data-set respectively. The right-most 
image shows the embedding defined by the first two principal components given by 
the trained kernel. 


One of the benefits of the proposed decomposable kernel is that, by learning its param¬ 
eters, we can determine if the discriminating information is contained in the structure or 
symbol level of the sequences. To evaluate this we generated a second toy data set Figure [4j 
The data-set consists of 10 noisy sine and square waves. Half of the sequences from each 
class have been altered such that 5 symbols at random places take the value 4. We will 
conduct two experiments on this data. In the first we will learn the parameters of the ker¬ 
nel as to separate the sine from the square waveform, while for the second experiment, we 
want to separate the sequences containing the randomly positioned symbol 4 irrespective of 
waveform. In the first experiment the structure of the symbols are much more important 
while in the second the only distinguishing aspect is that the sequence contains the symbol 
4. This is reflected by the learned parameters, for the first experiment the coefficent for 
making diagonal moves, Cd, which reflects the importance of the sequences being aligned, 
is much higher compared to the second experiment where there is little difference between 
diagonal and horizontal move reflecting that the information is contained at the symbol 
level of the sequences. In Figure [4] the embedding and the kernel matrices are shown. 


We will now proceed to evaluate the performance of the different kernels on a set of 
well known sequential classification data-sets from the UCI Machine Learning repository 


(Bache and Lichman, 2013) with varying length, dimension and number of classes, see 


Table [l] We compared three different kernels: the exponential, the path sequence kernels 
and the global alignment kerne'Cuturi et al. (2007). In each of the experiments the same 
exponential kernel is used to represent the symbol space such that the kernels only differ in 
how they represent the structural component of each sequence. Classification is performed 
by applying a support vector machine Chang and Lin (2011) to the space induced by the 
various kernel. The parameters of the kernels and the classifier are learned using nested- 
cross validation with 3 outer and inner folds and 3 and 20 repetitions respectively. The 
outer cross-validation iterates through different divisions of the data into training and test 
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Figure 4: The figure shows the second toy data-set presented in the paper. Each row corre¬ 
sponds to a specific experiment setting. The first two columns show the two input 
classes while the third shows the embedding defined by the two first principal com¬ 
ponents and the last column depicts the learned kernel matrix evaluated on the 
input data. 


sets. The inner cross-validation uses the training set to perform parameter selection, using 
the established number of folds and repetitions. The chosen parameters are finally used to 
test the model on the test set and the results are measured and averaged over the previously 
described number of folds and repetitions. 

To make sure that the discriminative information in the data resides in the structure 
and not only in the first order statistics of the symbol space we tried to classify the fixed 
length data by assuming each dimension to be independent. To do so we concatenated each 
symbol in a sequence and used an exponential kernel and a Euclidean distance as a similarity 
measure. The results are shown in Table [6] As can be seen, the performance for the Libras 
and the different PEMS data-sets are roughly random indicating that the structure is indeed 
important. Comparing the three different kernels, we can see that the path sequence kernel 
is consistently outperforming the other two kernels. It is interesting to see the significant 
difference in performance between the exponential and the path sequence kernels. We argue 
that the difference in performance is due to the stationary characteristics of the exponential 
kernel where the influence of each symbol match only depends on the difference in position. 
The path kernel has a more fine-grained characteristic where the actual position of a match 
incluences the similarity score. Both the path and the global alignment kernel take all 
possible alignments into account. As we see, the path kernel significantly outperforms 
the global alignment kernel. This can be explained by the dominating influence of the 
“best” alignment in the global alignment kernel compared to the path kernel which more 
gracefully accumulates information from all possible alignments into the final kernel value. 
This behavior also means that there is a strong preference towards sequences of the same 
length for the global alignment kernel which is something that can explain the big difference 
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dim 

length 

^classes 

#N 

AUSLAN 

22 

45-136 ( 55 ) 

95 

2565 

Libras 

2 

45 

15 

945 

PEMS100 

963 

144 

7 

440 

PEMS95 

335 

144 

7 

440 

PEMS90 

171 

144 

7 

440 

Vowels 

12 

7-29 ( 15 ) 

9 

640 

Characters 

3 

60-182 ( 122 ) 

20 

2858 


Table 1: The above table charaterizes the UCI (Bache and Lichman, 2013) data-sets on 


which our experimental evaluation is performed. From top to bottom the data sets 
where presented in {Kadous , 2002; Dias et al, 2009\ - Cuturi, 2010\ ' Kudo et al. 


1999; Williams et al, 2006). The column from left to right show the dimensionality 


of the symbol space, the range of lengths of the sequences and their median length 
within brackets, number of classes and the number of sequences. The three different 
PEMS data-sets are projections of the data onto its principal components such that 
100,95 and 90 percent of the variance in the data is retained. 


in performance compared with the path kernel on the AUSLAN data-set. We believe this 
shows the value of a non-stationary structured kernel for representing sequences. The path 
perspective provides an intuitive, rigorous and interpretable formulation for designing such 
new kernels. 


6. Conclusion 


In this paper we have presented an approach to combine kernels in a structured manner 
such that the resulting measurement represent a Mercer kernel. This leads to kernels that 
provides a (dis)similarity measure between sequences of different length. In particular we 


proved that a recently proposed (dis)similarity measure (Baisero et al., 2013) falls within 
this family which allows us to adopt the intuitive notion of alignments which we believe will 
be provide useful insights for designing new kernels. We showed experimentally how the 
path sequence kernel significantly outperforms previous state-of-the-art methods. In future 
work we aim to further establish the path perspective and discuss how new novel kernels 
which include higher-order paths which takes more than simple moves into account can be 
designed. 
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